128 research outputs found
Elevating commodity storage with the SALSA host translation layer
To satisfy increasing storage demands in both capacity and performance,
industry has turned to multiple storage technologies, including Flash SSDs and
SMR disks. These devices employ a translation layer that conceals the
idiosyncrasies of their mediums and enables random access. Device translation
layers are, however, inherently constrained: resources on the drive are scarce,
they cannot be adapted to application requirements, and lack visibility across
multiple devices. As a result, performance and durability of many storage
devices is severely degraded.
In this paper, we present SALSA: a translation layer that executes on the
host and allows unmodified applications to better utilize commodity storage.
SALSA supports a wide range of single- and multi-device optimizations and,
because is implemented in software, can adapt to specific workloads. We
describe SALSA's design, and demonstrate its significant benefits using
microbenchmarks and case studies based on three applications: MySQL, the Swift
object store, and a video server.Comment: Presented at 2018 IEEE 26th International Symposium on Modeling,
Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS
Complementing user-level coarse-grain parallelism with implicit speculative parallelism
Multi-core and many-core systems are the norm in contemporary processor technology
and are expected to remain so for the foreseeable future. Parallel programming
is, thus, here to stay and programmers have to endorse it if they are to exploit such
systems for their applications. Programs using parallel programming primitives like
PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good
trade-off between programming effort versus performance gain. Some parallel applications
show limited or no scaling beyond a number of cores. Given the abundant
number of cores expected in future many-cores, several cores would remain idle in such
cases while execution performance stagnates. This thesis proposes using cores that do
not contribute to performance improvement for running implicit fine-grain speculative
threads. In particular, we present a many-core architecture and protocols that allow
applications with coarse-grain explicit parallelism to further exploit implicit speculative
parallelism within each thread. We show that complementing parallel programs
with implicit speculative mechanisms offers significant performance improvements for
a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees
the programmer from the additional effort to explicitly partition the work into finer
and properly synchronized tasks. Our results show that, for a many-core comprising
128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance
improves on top of the highest scalability point by 44% on average for the
4-core cluster and by 31% on average for the 2-core cluster. We also show that this
approach often leads to better performance and energy efficiency compared to existing
alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic
mechanism to choose the number of explicit and implicit threads, which performs
within 6% of the static oracle selection of threads.
To improve energy efficiency processors allow for Dynamic Voltage and Frequency
Scaling (DVFS), which enables changing their performance and power consumption
on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads
scheme to traditional power management techniques for multithreaded applications
and identify room for improvement. We thus augment prior schemes and introduce
a novel multithreaded power management scheme that accounts for implicit threads
and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two
components: a âlocalâ component that tries to adapt to the different program phases
on a per explicit thread basis, taking into account implicit thread behavior, and a
âglobalâ component that augments the local components with information regarding
inter-thread synchronization. Experimental results show a reduction of ED2 of 8%
compared to having no power management, with an average reduction in power of
15% that comes at a minimal loss of performance of less than 3% on average
- âŠ